Search Results: "Julien Danjou"

13 August 2015

Julien Danjou: Reading LWN.net with Pocket

I've started to use Pocket a few months ago to store my backlog of things to read. It's especially useful as I can use it to read content offline since we still don't have any Internet access in places such as airplanes or the Paris metro. It's only 2015 after all. I am also a LWN.net subscriber for years now, and I really like their articles from the weekly edition. Unfortunately, as the access is restricted to subscribers, you need to login: it makes it impossible to add these articles to Pocket directly. Sad. Yesterday, I thought about that and decided to start hacking on it. LWN provides a feature called "Subscriber Link" that allows you to share an article with a friend. I managed to use that feature to share the articles with my friend Pocket! As doing that every week is tedious, I wrote a small Python program called lwn2pocket that I published on GitHub. Feel free to use it, hack it and send pull requests.

4 August 2015

Julien Danjou: Ceilometer, Gnocchi & Aodh: Liberty progress

It's been a while since I talked about Ceilometer and its companions, so I thought I'd go ahead and write a bit about what's going on this side of OpenStack. I'm not going to cover new features and fancy stuff today, but rather a shallow overview of the new project processes we initiated. Ceilometer growing Ceilometer has grown a lot since that time when we started it 3 years ago. It has evolved from a system designed to fetch and store measurements, to a more complex system, with agents, alarms, events, databases, APIs, etc. All those features were needed and asked for by users and operators, but let's be honest, some of them should never have ended up in the Ceilometer code repository, especially not all at the same time. The reality is we picked a pragmatic approach due to the rigidity of the OpenStack Technical Committee in regards to new projects to become OpenStack integrated and, therefore, blessed projects. Ceilometer was actually the first project to be incubated and then integrated. We had to go through the very first issues of that process. Fortunately, now that time has passed, and all those constraints have been relaxed. To me, the OpenStack Foundation is turning into something that looks like the Apache Foundation, and there's, therefore, no need to tie technical solutions to political issues. Indeed, the Big Tent now allows much more flexibility to all of that. Back a year ago, we were afraid to bring Gnocchi into Ceilometer. Was the Technical Committee going to review the project? Was the project going to be in the scope of Ceilometer for the Technical Committee? Now we don't have to ask ourselves those questions, now that we have that freedom, it empowers us to actually do what we think is good in term of technical design without worrying too much about political issues.

Acknowledging Gnocchi The first step in this new process was to continue working on Gnocchi (a timeserie database and resource indexer designed to overcome historical Ceilometer storage issue) and to decide that it was not the right call to merge it into Ceilometer as some REST API v3, but that it was better to keep it standalone. We managed to get traction to Gnocchi, getting a few contributors and users. We're even seeing talks proposed to the next Tokyo Summit where people leverage Gnocchi, such as "Service of predictive analytics on cost and performance in OpenStack", "Suveil" and "Cutting Edge NFV On OpenStack: Healing and Scaling Distributed Applications". We are also doing some progress on pushing Gnocchi outside of the OpenStack community, as it can be a self-sufficient timeserie and resource database that can be used without any OpenStack interaction. Branching Aodh Rather than continuing to grow Ceilometer, during the last summit we all decided that it was time to reorganize and split Ceilometer into the different components it is made of, leveraging a more service-oriented architecture. The alarm subsystem of Ceilometer being mostly untied to the rest of Ceilometer, we decided it was the first and perfect candidate to do that. I personally engaged into doing the work and created a new repository with only the alarm code from Ceilometer, named Aodh.

This made sense for a lot of reason. First because Aodh can now work completely standalone, using either Ceilometer or Gnocchi as a backend or any new plugin you'd write. I love the idea that OpenStack projects can work standalone like Swift does for example without implying any other OpenStack component. I think it's a proof of good design. Secondly, because it allows us to resonate on a smaller chunk of software a reason really under-estimated today in OpenStack. I believe that the size of your software should match a certain ratio to the size of your team. Aodh is, therefore, a new project under the OpenStack Telemetry program (or what remains of OpenStack programs now), alongside Ceilometer and Gnocchi, forked from the original Ceilometer alarm feature. We'll deprecate the latter with the Liberty release, and we'll remove it in the Mitaka release. Lessons learned Actually, moving that code out of Ceilometer (in the case of Aodh), or not merging it in (in the case of Gnocchi) had a few side effects that I admit I think we probably under-estimated back then. Indeed, the code size of Gnocchi or Aodh ended up being much smaller than the entire Ceilometer project Gnocchi is 7 smaller and Aodh 5x smaller than Ceilometer and therefore much more easy to manipulate and to hack on. That allowed us to merge dozens of patches in a few weeks, cleaning-up and enhancing a lot of small things in the code. Those tasks are very much harder in Ceilometer, due to the bigger size of the code base and the small size of our team. By having our small team working on smaller chunks of changes even when it meant actually doing more reviews greatly improved our general velocity and the number of bugs fixed and features implemented. On the more sociological side, I think it gave the team the sensation of finally owning the project. Ceilometer was huge, and it was impossible for people to know every side of it. Now, it's getting possible for people inside a team to cover a much larger portion of those smaller project, which gives them a greater sense of ownership and caring. Which ends up being good for the project quality overall. That also means that we technically decided to have different core teams by project (Ceilometer, Gnocchi, and Aodh) as they all serve different purposes and can all be used standalone or with each others. Meaning we could have contributors completely ignoring other projects. All of that reminds me some discussion I heard about projects such as Glance, trying to fit new features in - some that are really orthogonal to the original purpose. It's now clear to me that having different small components interacting together that can be completely owned and taken care of by a (small) team of contributors is the way to go. People that can therefore trust each others and easily bring new people in, makes a project really incredibly more powerful. Having a project covering a too wide set of features make things more difficult if you don't have enough manpower. This is clearly an issue that big projects inside OpenStack are facing now, such as Neutron or Nova.

16 June 2015

Julien Danjou: Timezones and Python

Recently, I've been fighting with the never ending issue of timezones. I never thought I would have plunged into this rabbit hole, but hacking on OpenStack and Gnocchi I felt into that trap easily is, thanks to Python. Why you really, really, should never ever deal with timezones To get a glimpse of the complexity of timezones, I recommend that you watch Tom Scott's video on the subject. It's fun and it summarizes remarkably well the nightmare that timezones are and why you should stop thinking that you're smart.

The importance of timezones in applications

Once you've heard what Tom says, I think it gets pretty clear that a timestamp without any timezone attached does not give any useful information. It should be considered irrelevant and useless. Without the necessary context given by the timezone, you cannot infer what point in time your application is really referring to. That means your application should never handle timestamps with no timezone information. It should try to guess or raises an error if no timezone is provided in any input. Of course, you can infer that having no timezone information means UTC. This sounds very handy, but can also be dangerous in certain applications or language such as Python, as we'll see. Indeed, in certain applications, converting timestamps to UTC and losing the timezone information is a terrible idea. Imagine that a user create a recurring event every Wednesday at 10:00 in its local timezone, say CET. If you convert that to UTC, the event will end up being stored as every Wednesday at 09:00. Now imagine that the CET timezone switches from UTC+01:00 to UTC+02:00: your application will compute that the event starts at 11:00 CET every Wednesday. Which is wrong, because as the user told you, the event starts at 10:00 CET, whatever the definition of CET is. Not at 11:00 CET. So CET means CET, not necessarily UTC+1. As for endpoints like REST API, a thing I daily deal with, all timestamps should include a timezone information. It's nearly impossible to know what timezone the timestamps are in otherwise: UTC? Server local? User local? No way to know. Python design & defect

Python comes with a timestamp object named datetime.datetime. It can store date and time precise to the microsecond, and is qualified of timezone "aware" or "unaware", whether it embeds a timezone information or not. To build such an object based on the current time, one can use datetime.datetime.utcnow() to retrieve the date and time for the UTC timezone, and datetime.datetime.now() to retrieve the date and time for the current timezone, whatever it is.

>>> import datetime
>>> datetime.datetime.utcnow()
datetime.datetime(2015, 6, 15, 13, 24, 48, 27631)
>>> datetime.datetime.now()
datetime.datetime(2015, 6, 15, 15, 24, 52, 276161)

As you can notice, none of these results contains timezone information. Indeed, Python datetime API always returns unaware datetime objects, which is very unfortunate. Indeed, as soon as you get one of this object, there is no way to know what the timezone is, therefore these objects are pretty "useless" on their own. Armin Ronacher proposes that an application always consider that the unaware datetime objects from Python are considered as UTC. As we just saw, that statement cannot be considered true for objects returned by datetime.datetime.now(), so I would not advise doing so. datetime objects with no timezone should be considered as a "bug" in the application.

Recommendations My recommendation list comes down to:

Always use aware datetime object, i.e. with timezone information. That makes sure you can compare them directly (aware and unaware datetime objects are not comparable) and will return them correctly to users. Leverage pytz to have timezone objects.
Use ISO 8601 as input and output string format. Use datetime.datetime.isoformat() to return timestamps as string formatted using that format, which includes the timezone information.

In Python, that's equivalent to having:

>>> import datetime
>>> import pytz
>>> def utcnow():
    return datetime.datetime.now(tz=pytz.utc)
>>> utcnow()
datetime.datetime(2015, 6, 15, 14, 45, 19, 182703, tzinfo=<UTC>)
>>> utcnow().isoformat()
'2015-06-15T14:45:21.982600+00:00'

If you need to parse strings containing ISO 8601 formatted timestamp, you can rely on the iso8601, which returns timestamps with correct timezone information. This makes timestamps directly comparable:

>>> import iso8601
>>> iso8601.parse_date(utcnow().isoformat())
datetime.datetime(2015, 6, 15, 14, 46, 43, 945813, tzinfo=<FixedOffset '+00:00' datetime.timedelta(0)>)
>>> iso8601.parse_date(utcnow().isoformat()) < utcnow()
True

If you need to store those timestamps, the same rule should apply. If you rely on MongoDB, it assumes that all the timestamp are in UTC, so be careful when storing them you will have to normalize the timestamp to UTC. For MySQL, nothing is assumed, it's up to the application to insert them in a timezone that makes sense to it. Obviously, if you have multiple applications accessing the same database with different data sources, this can end up being a nightmare. PostgreSQL has a special data type that is recommended called timestamp with timezone, and which can store the timezone associated, and do all the computation for you. That's the recommended way to store them obviously. That does not mean you should not use UTC in most cases; that just means you are sure that the timestamp are stored in UTC since it's written in the database, and you check if any other application inserted timestamps with different timezone. OpenStack status As a side note, I've improved OpenStack situation recently by changing the oslo.utils.timeutils module to deprecate some useless and dangerous functions. I've also added support for returning timezone aware objects when using the oslo_utils.timeutils.utcnow() function. It's not possible to make it a default unfortunately for backward compatibility reason, but it's there nevertheless, and it's advised to use it. Thanks to my colleague Victor for the help! Have a nice day, whatever your timezone is!

2 June 2015

Julien Danjou: Get back up and try again: retrying in Python

I don't often write about tools I use when for my daily software development tasks. I recently realized that I really should start to share more often my workflows and weapons of choice. One thing that I have a hard time enduring while doing Python code reviews, is people writing utility code that is not directly tied to the core of their business. This looks to me as wasted time maintaining code that should be reused from elsewhere. So today I'd like to start with retrying, a Python package that you can use to retry anything. It's OK to fail

Often in computing, you have to deal with external resources. That means accessing resources you don't control. Resources that can fail, become flapping, unreachable or unavailable. Most applications don't deal with that at all, and explode in flight, leaving a skeptical user in front of the computer. A lot of software engineers refuse to deal with failure, and don't bother handling this kind of scenario in their code. In the best case, applications usually handle simply the case where the external reached system is out of order. They log something, and inform the user that it should try again later. In this cloud computing area, we tend to design software components with service-oriented architecture in mind. That means having a lot of different services talking to each others over the network. And we all know that networks tend to fail, and distributed systems too. Writing software with failing being part of normal operation is a terrific idea. Retrying

In order to help applications with the handling of these potential failures, you need a plan. Leaving to the user the burden to "try again later" is rarely a good choice. Therefore, most of the time you want your application to retry. Retrying an action is a full strategy on its own, with a lot of options. You can retry only on certain condition, and with the number of tries based on time (e.g. every second), based on a number of tentative (e.g. retry 3 times and abort), based on the problem encountered, or even on all of those. For all of that, I use the retrying library that you can retrieve easily on PyPI. retrying provides a decorator called retry that you can use on top of any function or method in Python to make it retry in case of failure. By default, retry calls your function endlessly until it returns rather than raising an error.

import random
from retrying import retry
 
@retry
def pick_one():
    if random.randint(0, 10) != 1:
        raise Exception("1 was not picked")

This will execute the function pick_one until 1 is returned by random.randint. retry accepts a few arguments, such as the minimum and maximum delays to use, which also can be randomized. Randomizing delay is a good strategy to avoid detectable pattern or congestion. But more over, it supports exponential delay, which can be used to implement exponential backoff, a good solution for retrying tasks while really avoiding congestion. It's especially handy for background tasks.

@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000)
def wait_exponential_1000():
    print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"
    raise Exception("Retry!")

You can mix that with a maximum delay, which can give you a good strategy to retry for a while, and then fail anyway:

# Stop retrying after 30 seconds anyway
>>> @retry(wait_exponential_multiplier=1000, wait_exponential_max=10000, stop_max_delay=30000)
... def wait_exponential_1000():
...     print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"
...     raise Exception("Retry!")
...
>>> wait_exponential_1000()
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "<stdin>", line 4, in wait_exponential_1000
  Exception: Retry!

A pattern I use very often, is the ability to retry only based on some exception type. You can specify a function to filter out exception you want to ignore or the one you want to use to retry.

def retry_on_ioerror(exc):
    return isinstance(exc, IOError)
 
@retry(retry_on_exception=retry_on_ioerror)
def read_file():
    with open("myfile", "r") as f:
        return f.read()

retry will call the function passed as retry_on_exception with the exception raised as first argument. It's up to the function to then return a boolean indicating if a retry should be performed or not. In the example above, this will only retry to read the file if an IOError occurs; if any other exception type is raised, no retry will be performed. The same pattern can be implemented using the keyword argument retry_on_result, where you can provide a function that analyses the result and retry based on it.

def retry_if_file_empty(result):
    return len(result) <= 0
 
@retry(retry_on_result=retry_if_file_empty)
def read_file():
    with open("myfile", "r") as f:
        return f.read()

This example will read the file until it stops being empty. If the file does not exist, an IOError is raised, and the default behavior which triggers retry on all exceptions kicks-in the retry is therefore performed. That's it! retry is really a good and small library that you should leverage rather than implementing your own half-baked solution!

26 May 2015

Julien Danjou: OpenStack Summit Liberty from a Ceilometer & Gnocchi point of view

Last week I was in Vancouver, BC for the OpenStack Summit, discussing the new Liberty version that will be released in 6 months. I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. Ops feedback

We had half a dozen Ceilometer sessions, and the first one was dedicated to getting feedbacks from operators using Ceilometer. We had a few operators present, and a few of the Ceilometer team. We had constructive discussion, and my feeling is that operators struggles with 2 things so far: scaling Ceilometer storage and having Ceilometer not killing the rest of OpenStack. We discussed the first point as being addressed by Gnocchi, and I presented a bit Gnocchi itself, as well as how and why it will fix the storage scalability issue operators encountered so far. Ceilometer putting down the OpenStack installation is more interesting problem. Ceilometer pollsters request information from Nova, Glance to gather statistics. Until Kilo, Ceilometer used to do that regularly and at fixed interval, causing high pike load in OpenStack. With the introduction of jitter in Kilo, this should be less of a problem. However, Ceilometer hits various endpoints in OpenStack that are poorly designed, and hitting those endpoints of Nova or other components triggers a lot of load on the platform. Unfortunately, this makes operators blame Ceilometer rather than blaming the components being guilty of poor designs. We'd like to push forward improving these components, but it's probably going to take a long time. Componentisation

When I started the Gnocchi project last year, I pretty soon realized that we would be able to split Ceilometer itself in different smaller components that could work independently, while being able to leverage each others. For example, Gnocchi can run standalone and store your metrics even if you don't use Ceilometer nor even OpenStack itself. My fellow developer Chris Dent had the same idea about splitting Ceilometer a few months ago and drafted a proposal. The idea is to have Ceilometer split in different parts that people could assemble together or run on their owns. Interestingly enough, we had three 40 minutes sessions planned to talk and debate about this division of Ceilometer, though we all agreed in 5 minutes that this was the good thing to do. Five more minutes later, we agreed on which part to split. The rest of the time was allocated to discuss various details of that split, and I engaged to start doing the work with Ceilometer alarming subsystem. I wrote a specification on the plane bringing me to Vancouver, that should be approved pretty soon now. I already started doing the implementation work. So fingers crossed, Ceilometer should have a new components in Liberty handling alarming on its own. This would allow users for example to only deploys Gnocchi and Ceilometer alarm. They would be able to feed data to Gnocchi using their own system, and build alarms using Ceilometer alarm subsystem relying on Gnocchi's data. Gnocchi

We didn't have a Gnocchi dedicated slot mainly because I indicated I didn't feel we needed one. We anyway discussed a few points around coffee, and I've been able to draw a few new ideas and changes I'd like to see in Gnocchi. Mainly changing the API contract to be more asynchronously so we can support InfluxDB more correctly, and improve Carbonara (the library we created to manipulate timeseries) based drivers to be faster. All of those should plus a few Oslo tasks I'd like to tackle should keep me busy for the next cycle!

11 May 2015

Julien Danjou: My interview about software tests and Python

I've recently been contacted by Johannes Hubertz, who is writing a new book about Python in German called "Softwaretests mit Python" which will be published by Open Source Press, Munich this summer. His book will feature some interviews, and he was kind enough to let me write a bit about software testing. This is the interview that I gave for his book. Johannes translated to German and it will be included in Johannes' book, and I decided to publish it on my blog today. Following is the original version. How did you come to Python?

I don't recall exactly, but around ten years ago, I saw more and more people using it and decided to take a look. Back then, I was more used to Perl. I didn't really like Perl and was not getting a good grip on its object system. As soon as I found an idea to work on if I remember correctly that was rebuildd I started to code in Python, learning the language at the same time. I liked how Python worked, and how fast I was to able to develop and learn it, so I decided to keep using it for my next projects. I ended up diving into Python core for some reasons, even doing things like briefly hacking on projects like Cython at some point, and finally ended up working on OpenStack. OpenStack is a cloud computing platform entirely written in Python. So I've been writing Python every day since working on it. That's what pushed me to write The Hacker's Guide to Python in 2013 and then self-publish it a year later in 2014, a book where I talk about doing smart and efficient Python. It had a great success, has even been translated in Chinese and Korean, so I'm currently working on a second edition of the book. It has been an amazing adventure! Zen of Python: Which line is the most important for you and why?

I like the "There should be one and preferably only one obvious way to do it". The opposite is probably something that scared me in languages like Perl. But having one obvious way to do it is and something I tend to like in functional languages like Lisp, which are in my humble opinion, even better at that. For a python newbie, what are the most difficult subjects in Python?

I haven't been a newbie since a while, so it's hard for me to say. I don't think the language is hard to learn. There are some subtlety in the language itself when you deeply dive into the internals, but for beginners most of the concept are pretty straight-forward. If I had to pick, in the language basics, the most difficult thing would be around the generator objects (yield). Nowadays I think the most difficult subject for new comers is what version of Python to use, which libraries to rely on, and how to package and distribute projects. Though things get better, fortunately. When did you start using Test Driven Development and why?

I learned unit testing and TDD at school where teachers forced me to learn Java, and I hated it. The frameworks looked complicated, and I had the impression I was losing my time. Which I actually was, since I was writing disposable programs that's the only thing you do at school. Years later, when I started to write real and bigger programs (e.g. rebuildd), I quickly ended up fixing bugs I already fixed. That recalled me about unit tests and that it may be a good idea to start using them to stop fixing the same things over and over again. For a few years, I wrote less Python and more C code and Lua (for the awesome window manager), and I didn't use any testing. I probably lost hundreds of hours testing manually and fixing regressions that was a good lesson. Though I had good excuses at that time it is/was way harder to do testing in C/Lua than in Python. Since that period, I have never stopped writing "tests". When I started to hack on OpenStack, the project was adopting a "no test? no merge!" policy due to the high number of regressions it had during the first releases. I honestly don't think I could work on any project that does not have at least a minimal test coverage. It's impossible to hack efficiently on a code base that you're not able to test in just a simple command. It's also a real problem for new comers in the open source world. When there are no test, you can hack something and send a patch, and get a "you broke this" in response. Nowadays, this kind of response sounds unacceptable to me: if there is no test, then I didn't break anything! In the end, it's just too much frustration to work on non tested projects as I demonstrated in my study of whisper source code. What do you think to be the most often seen pitfalls of TDD and how to avoid them best?

The biggest problems are when and at what rate writing tests. On one hand, some people starts to write too precise tests way too soon. Doing that slows you down, especially when you are prototyping some idea or concept you just had. That does not mean that you should not do test at all, but you should probably start with a light coverage, until you are pretty sure that you're not going to rip every thing and start over. On the other hand, some people postpone writing tests for ever, and end up with no test all or a too thin layer of test. Which makes the project with a pretty low coverage. Basically, your test coverage should reflect the state of your project. If it's just starting, you should build a thin layer of test so you can hack it on it easily and remodel it if needed. The more your project grow, the more you should make it sold and lay more tests. Having too detailed tests is painful to make the project evolve at the start. Having not enough in a big project makes it painful to maintain it. Do you think, TDD fits and scales well for the big projects like OpenStack?

Not only I think it fits and scales well, but I also think it's just impossible to not use TDD in such big projects. When unit and functional tests coverage was weak in OpenStack at its beginning it was just impossible to fix a bug or write a new feature without breaking a lot of things without even noticing. We would release version N, and a ton of old bugs present in N-2 but fixed in N-1 were reopened. For big projects, with a lot of different use cases, configuration options, etc, you need belt and braces. You cannot throw code in a repository thinking it's going to work ever, and you can't afford to test everything manually at each commit. That's just insane.

4 May 2015

Julien Danjou: The Hacker's Guide to Python, 2nd edition!

A year passed since the first release of The Hacker's Guide to Python in March 2014. A few hundreds copies have been distributed so far, and the feedback is wonderful! I already wrote extensively about the making of that book last year, and I cannot emphasize enough how this adventure has been amazing so far. That's why I decided a few months ago to update the guide and add some new content. So let's talk about what's new in this second edition of the book! First, I obviously fixed a few things. I had some reports about small mistakes and typos which I applied as I received them. Not a lot fortunately, but it's still better to have fewer errors in a book, right? Then, I updated some of the content. Things changed since I wrote the first chapters of that guide 18 months ago. Therefore I had to rewrite some of the sections and take into account new software or libraries that were released. At last, I decided to enhance the book with one more interview. I've requested my fellow OpenStack developer Joshua Harlow, who is leading a few interesting Python projects, to join the long list of interviewees in the book. I hope you'll enjoy it! If you didn't get the book yet, go check it out and use the coupon THGTP2LAUNCH to get 20% off during the next 48 hours!

21 April 2015

Julien Danjou: Gnocchi 1.0: storing metrics and resources at scale

A few months ago, I wrote a long post about what I called back then the "Gnocchi experiment". Time passed and we me and the rest of the Gnocchi team continued to work on that project, finalizing it. It's with a great pleasure that we are going to release our first 1.0 version this month, roughly at the same time that the integrated OpenStack projects release their Kilo milestone. The first release candidate numbered 1.0.0rc1 has been released this morning! The problem to solve Before I dive into Gnocchi details, it's important to have a good view of what problems Gnocchi is trying to solve. Most of the IT infrastructures out there consists of a set of resources. These resources have properties: some of them are simple attributes whereas others might be measurable quantities (also known as metrics). And in this context, the cloud infrastructures make no exception. We talk about instances, volumes, networks which are all different kind of resources. The problems that are arising with the cloud trend is the scalability of storing all this data and being able to request them later, for whatever usage. What Gnocchi provides is a REST API that allows the user to manipulate resources (CRUD) and their attributes, while preserving the history of those resources and their attributes. Gnocchi is fully documented and the documentation is available online. We are the first OpenStack project to require patches to integrate the documentation. We want to raise the bar, so we took a stand on that. That's part of our policy, the same way it's part of the OpenStack policy to require unit tests. I'm not going to paraphrase the whole Gnocchi documentation, which covers things like installation (super easy), but I'll guide you through some basics of the features provided by the REST API. I will show you some example so you can have a better understanding of what you could leverage using Gnocchi! Handling metrics

Gnocchi provides a full REST API to manipulate time-series that are called metrics. You can easily create a metric using a simple HTTP request:

POST /v1/metric HTTP/1.1
Content-Type: application/json
 
 
  "archive_policy_name": "low"
 
 
HTTP/1.1 201 Created
Location: http://localhost/v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a
Content-Type: application/json; charset=UTF-8
 
 
  "archive_policy":  
    "aggregation_methods": [
      "std",
      "sum",
      "mean",
      "count",
      "max",
      "median",
      "min",
      "95pct"
    ],
    "back_window": 0,
    "definition": [
       
        "granularity": "0:00:01",
        "points": 3600,
        "timespan": "1:00:00"
       ,
       
        "granularity": "0:30:00",
        "points": 48,
        "timespan": "1 day, 0:00:00"
       
    ],
    "name": "low"
   ,
  "created_by_project_id": "e8afeeb3-4ae6-4888-96f8-2fae69d24c01",
  "created_by_user_id": "c10829c6-48e2-4d14-ac2b-bfba3b17216a",
  "id": "387101dc-e4b1-4602-8f40-e7be9f0ed46a",
  "name": null,
  "resource_id": null

The archive_policy_name parameter defines how the measures that are being sent are going to be aggregated. You can also define archive policies using the API and specify what kind of aggregation period and granularity you want. In that case , the low archive policy keeps 1 hour of data aggregated over 1 second and 1 day of data aggregated to 30 minutes. The functions used for aggregations are the mathematical functions standard deviation, minimum, maximum, and even 95th percentile. All of that is obviously customizable and you can create your own archive policies. If you don't want to specify the archive policy manually for each metric, you can also create archive policy rule, that will apply a specific archive policy based on the metric name, e.g. metrics matching disk.* will be high resolution metrics so they will use the high archive policy. It's also worth noting Gnocchi is precise up to the nanosecond and is not tied to the current time. You can manipulate and inject measures that are years old and precise to the nanosecond. You can also inject points with old timestamps (i.e. old compared to the most recent one in the timeseries) with an archive policy allowing it (see back_window parameter). It's then possible to send measures to this metric:

POST /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1
Content-Type: application/json
 
[
   
    "timestamp": "2014-10-06T14:33:57",
    "value": 43.1
   ,
   
    "timestamp": "2014-10-06T14:34:12",
    "value": 12
   ,
   
    "timestamp": "2014-10-06T14:34:20",
    "value": 2
   
  ]
  
HTTP/1.1 204 No Content

These measures are synchronously aggregated and stored into the configured storage backend. Our most scalable storage drivers for now are either based on Swift or Ceph which are both scalable storage objects systems. It's then possible to retrieve these values:

GET /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1
 
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
 
[
  [
    "2014-10-06T14:30:00.000000Z",
    1800.0,
    19.033333333333335
  ],
  [
    "2014-10-06T14:33:57.000000Z",
    1.0,
    43.1
  ],
  [
    "2014-10-06T14:34:12.000000Z",
    1.0,
    12.0
  ],
  [
    "2014-10-06T14:34:20.000000Z",
    1.0,
    2.0
  ]
]

As older Ceilometer users might notice here, metrics are only storing points and values, nothing fancy such as metadata anymore. By default, values eagerly aggregated using mean are returned for all supported granularities. You can obviously specify a time range or a different aggregation function using the aggregation, start and stop query parameter. Gnocchi also supports doing aggregation across aggregated metrics:

GET /v1/aggregation/metric?metric=65071775-52a8-4d2e-abb3-1377c2fe5c55&metric=9ccdd0d6-f56a-4bba-93dc-154980b6e69a&start=2014-10-06T14:34&aggregation=mean HTTP/1.1
 
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
 
[
  [
    "2014-10-06T14:34:12.000000Z",
    1.0,
    12.25
  ],
  [
    "2014-10-06T14:34:20.000000Z",
    1.0,
    11.6
  ]
]

This computes the mean of mean for the metric 65071775-52a8-4d2e-abb3-1377c2fe5c55 and 9ccdd0d6-f56a-4bba-93dc-154980b6e69a starting on 6th October 2014 at 14:34 UTC. Indexing your resources

Another object and concept that Gnocchi provides is the ability to manipulate resources. There is a basic type of resource, called generic, which has very few attributes. You can extend this type to specialize it, and that's what Gnocchi does by default by providing resource types known for OpenStack such as instance, volume, network or even image.

POST /v1/resource/generic HTTP/1.1
 
Content-Type: application/json
 
 
  "id": "75C44741-CC60-4033-804E-2D3098C7D2E9",
  "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",
  "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"
 
 
HTTP/1.1 201 Created
Location: http://localhost/v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9
ETag: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd"
Last-Modified: Fri, 17 Apr 2015 11:18:48 GMT
Content-Type: application/json; charset=UTF-8
 
 
  "created_by_project_id": "ec181da1-25dd-4a55-aa18-109b19e7df3a",
  "created_by_user_id": "4543aa2a-6ebf-4edd-9ee0-f81abe6bb742",
  "ended_at": null,
  "id": "75c44741-cc60-4033-804e-2d3098c7d2e9",
  "metrics":  ,
  "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",
  "revision_end": null,
  "revision_start": "2015-04-17T11:18:48.696288Z",
  "started_at": "2015-04-17T11:18:48.696275Z",
  "type": "generic",
  "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"

The resource is created with the UUID provided by the user. Gnocchi handles the history of the resource, and that's what the revision_start and revision_end fields are for. They indicates the lifetime of this revision of the resource. The ETag and Last-Modified headers are also unique to this resource revision and can be used in a subsequent request using If-Match or If-Not-Match header, for example:

GET /v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 HTTP/1.1
If-Not-Match: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd"
 
HTTP/1.1 304 Not Modified

Which is useful to synchronize and update any view of the resources you might have in your application. You can use the PATCH HTTP method to modify properties of the resource, which will create a new revision of the resource. The history of the resources are available via the REST API obviously. The metrics properties of the resource allow you to link metrics to a resource. You can link existing metrics or create new ones dynamically:

POST /v1/resource/generic HTTP/1.1
Content-Type: application/json
 
 
  "id": "AB68DA77-FA82-4E67-ABA9-270C5A98CBCB",
  "metrics":  
    "temperature":  
      "archive_policy_name": "low"
     
   ,
  "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",
  "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"
 
 
HTTP/1.1 201 Created
Location: http://localhost/v1/resource/generic/ab68da77-fa82-4e67-aba9-270c5a98cbcb
ETag: "9f64c8890989565514eb50c5517ff01816d12ff6"
Last-Modified: Fri, 17 Apr 2015 14:39:22 GMT
Content-Type: application/json; charset=UTF-8
 
 
  "created_by_project_id": "cfa2ebb5-bbf9-448f-8b65-2087fbecf6ad",
  "created_by_user_id": "6aadfc0a-da22-4e69-b614-4e1699d9e8eb",
  "ended_at": null,
  "id": "ab68da77-fa82-4e67-aba9-270c5a98cbcb",
  "metrics":  
    "temperature": "ad53cf29-6d23-48c5-87c1-f3bf5e8bb4a0"
   ,
  "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",
  "revision_end": null,
  "revision_start": "2015-04-17T14:39:22.181615Z",
  "started_at": "2015-04-17T14:39:22.181601Z",
  "type": "generic",
  "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"

Haystack, needle? Find! With such a system, it becomes very easy to index all your resources, meter them and retrieve this data. What's even more interesting is to query the system to find and list the resources you are interested in! You can search for a resource based on any field, for example:

POST /v1/search/resource/instance HTTP/1.1
Content-Type: application/json
 
 
  "=":  
    "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"

That query will return a list of all resources owned by the user_id bd3a1e52-1c62-44cb-bf04-660bd88cd74d. You can do fancier queries such as retrieving all the instances started by a user this month:

POST /v1/search/resource/instance HTTP/1.1
Content-Type: application/json
Content-Length: 113
 
 
  "and": [
     
      "=":  
        "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"
       
     ,
     
      ">=":  
        "started_at": "2015-04-01"
       
     
  ]

And you can even do fancier queries than the fancier ones (still following?). What if we wanted to retrieve all the instances that were on host foobar the 15th April and who had already 30 minutes of uptime? Let's ask Gnocchi to look in the history!

POST /v1/search/resource/instance?history=true HTTP/1.1
Content-Type: application/json
Content-Length: 113
 
 
  "and": [
     
      "=":  
        "host": "foobar"
       
     ,
     
      ">=":  
        "lifespan": "1 hour"
       
     ,
     
      "<=":  
        "revision_start": "2015-04-15"
       
     
 
  ]

I could also mention the fact that you can search for value in metrics. One feature that I will very likely include in Gnocchi 1.1 is the ability to search for resource whose specific metrics matches some value. For example, having the ability to search for instances whose CPU consumption was over 80% during a month. Cherries on the cake While Gnocchi is well integrated and based on common OpenStack technology, please do note that it is completely able to function without any other OpenStack component and is pretty straight-forward to deploy. Gnocchi also implements a full RBAC system based on the OpenStack standard oslo.policy and which allows pretty fine grained control of permissions.

There is also some work ongoing to have HTML rendering when browsing the API using a Web browser. While still simple, we'd like to have a minimal Web interface served on top of the API for the same price! Ceilometer alarm subsystem supports Gnocchi with the Kilo release, meaning you can use it to trigger actions when a metric value crosses some threshold. And OpenStack Heat also supports auto-scaling your instances based on Ceilometer+Gnocchi alarms. And there are a few more API calls that I didn't talk about here, so don't hesitate to take a peek at the full documentation! Towards Gnocchi 1.1! Gnocchi is a different beast in the OpenStack community. It is under the umbrella of the Ceilometer program, but it's one of the first projects that is not part of the (old) integrated release. Therefore we decided to have a release schedule not directly linked to the OpenStack and we'll release more often that the rest of the old OpenStack components probably once every 2 months or the like. What's coming next is a close integration with Ceilometer (e.g. moving the dispatcher code from Gnocchi to Ceilometer) and probably more features as we have more requests from our users. We are also exploring different backends such as InfluxDB (storage) or MongoDB (indexer). Stay tuned, and happy hacking!

16 February 2015

Julien Danjou: Hacking Python AST: checking methods declaration

A few months ago, I wrote the definitive guide about Python method declaration, which had quite a good success. I still fight every day in OpenStack to have the developers declare their methods correctly in the patches they submit. Automation plan The thing is, I really dislike doing the same things over and over again. Furthermore, I'm not perfect either, and I miss a lot of these kind of problems in the reviews I made. So I decided to replace me by a program a more scalable and less error-prone version of my brain. In OpenStack, we rely on flake8 to do static analysis of our Python code in order to spot common programming mistakes. But we are really pedantic, so we wrote some extra hacking rules that we enforce on our code. To that end, we wrote a flake8 extension called hacking. I really like these rules, I even recommend to apply them in your own project. Though I might be biased or victim of Stockholm syndrome. Your call. Anyway, it's pretty clear that I need to add a check for method declaration in hacking. Let's write a flake8 extension! Typical error The typical error I spot is the following:

class Foo(object):
    # self is not used, the method does not need
    # to be bound, it should be declared static
    def bar(self, a, b, c):
        return a + b - c

That would be the correct version:

class Foo(object):
    @staticmethod
    def bar(a, b, c):
        return a + b - c

This kind of mistake is not a show-stopper. It's just not optimized. Why you have to manually declare static or class methods might be a language issue, but I don't want to debate about Python misfeatures or design flaws. Strategy We could probably use some big magical regular expression to catch this problem. flake8 is based on the pep8 tool, which can do a line by line analysis of the code. But this method would make it very hard and error prone to detect this pattern. Though it's also possible to do an AST based analysis on on a per-file basis with pep8. So that's the method I pick as it's the most solid. AST analysis I won't dive deeply into Python AST and how it works. You can find plenty of sources on the Internet, and I even talk about it a bit in my book The Hacker's Guide to Python.

To check correctly if all the methods in a Python file are correctly declared, we need to do the following:

Iterate over all the statement node of the AST
Check that the statement is a class definition (ast.ClassDef)
Iterate over all the function definitions (ast.FunctionDef) of that class statement to check if it is already declared with @staticmethod or not
If the method is not declared static, we need to check if the first argument (self) is used somewhere in the method

Flake8 plugin In order to register a new plugin in flake8 via hacking, we just need to add an entry in setup.cfg:

[entry_points]
flake8.extension =
    [ ]
    H904 = hacking.checks.other:StaticmethodChecker
    H905 = hacking.checks.other:StaticmethodChecker

We register 2 hacking codes here. As you will notice later, we are actually going to add an extra check in our code for the same price. Stay tuned. The next step is to write the actual plugin. Since we are using an AST based check, the plugin needs to be a class following a certain signature:

@core.flake8ext
class StaticmethodChecker(object):
    def __init__(self, tree, filename):
        self.tree = tree
 
    def run(self):
        pass

So far, so good and pretty easy. We store the tree locally, then we just need to use it in run() and yield the problem we discover following pep8 expected signature, which is a tuple of

(lineno, col_offset, error_string,
code)

. This AST is made for walking The ast module provides the walk function, that allow to iterate easily on a tree. We'll use that to run through the AST. First, let's write a loop that ignores the statement that are not class definition.

@core.flake8ext
class StaticmethodChecker(object):
    def __init__(self, tree, filename):
        self.tree = tree
 
    def run(self):
        for stmt in ast.walk(self.tree):
            # Ignore non-class
            if not isinstance(stmt, ast.ClassDef):
                continue

We still don't check for anything, but we know how to ignore statement that are not class definitions. The next step need to be to ignore what is not function definition. We just iterate over the attributes of the class definition.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue

We're all set for checking the method, which is body_item. First, we need to check if it's already declared as static. If so, we don't have to do any further check and we can bail out.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue
        # Check that it has a decorator
        for decorator in body_item.decorator_list:
            if (isinstance(decorator, ast.Name)
               and decorator.id == 'staticmethod'):
                # It's a static function, it's OK
                break
        else:
            # Function is not static, we do nothing for now
            pass

Note that we use the special for/else form of Python, where the else is evaluated unless we used break to exit the for loop.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue
        # Check that it has a decorator
        for decorator in body_item.decorator_list:
            if (isinstance(decorator, ast.Name)
               and decorator.id == 'staticmethod'):
                # It's a static function, it's OK
                break
        else:
            try:
                first_arg = body_item.args.args[0]
            except IndexError:
                yield (
                    body_item.lineno,
                    body_item.col_offset,
                    "H905: method misses first argument",
                    "H905",
                )
                # Check next method
                continue

We finally added some check! We grab the first argument from the method signature. Unless it fails, and in that case, we know there's a problem: you can't have a bound method without the self argument, therefore we raise the H905 code to signal a method that misses its first argument. Now you know why we registered this second pep8 code along with H904 in setup.cfg. We have here a good opportunity to kill two birds with one stone. The next step is to check if that first argument is used in the code of the method.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue
        # Check that it has a decorator
        for decorator in body_item.decorator_list:
            if (isinstance(decorator, ast.Name)
               and decorator.id == 'staticmethod'):
                # It's a static function, it's OK
                break
        else:
            try:
                first_arg = body_item.args.args[0]
            except IndexError:
                yield (
                    body_item.lineno,
                    body_item.col_offset,
                    "H905: method misses first argument",
                    "H905",
                )
                # Check next method
                continue
            for func_stmt in ast.walk(body_item):
                if six.PY3:
                    if (isinstance(func_stmt, ast.Name)
                       and first_arg.arg == func_stmt.id):
                        # The first argument is used, it's OK
                        break
                else:
                    if (func_stmt != first_arg
                       and isinstance(func_stmt, ast.Name)
                       and func_stmt.id == first_arg.id):
                        # The first argument is used, it's OK
                        break
            else:
                yield (
                    body_item.lineno,
                    body_item.col_offset,
                    "H904: method should be declared static",
                    "H904",
                )

To that end, we iterate using ast.walk again and we look for the use of the same variable named (usually self, but if could be anything, like cls for @classmethod) in the body of the function. If not found, we finally yield the H904 error code. Otherwise, we're good. Conclusion I've submitted this patch to hacking, and, finger crossed, it might be merged one day. If it's not I'll create a new Python package with that check for flake8. The actual submitted code is a bit more complex to take into account the use of abc module and include some tests. As you may have notice, the code walks over the module AST definition several times. There might be a couple of optimization to browse the AST in only one pass, but I'm not sure it's worth it considering the actual usage of the tool. I'll let that as an exercise for the reader interested in contributing to OpenStack. Happy hacking!

The Hacker's Guide to Python

A book I wrote talking about designing Python applications, state of the art, advice to apply when building your application, various Python tips, etc. Interested? Check it out.

21 November 2014

Julien Danjou: Distributed group management and locking in Python with tooz

With OpenStack embracing the Tooz library more and more over the past year, I think it's a good start to write a bit about it. A bit of history A little more than year ago, with my colleague Yassine Lamgarchal and others at eNovance, we investigated on how to solve a problem often encountered inside OpenStack: synchronization of multiple distributed workers. And while many people in our ecosystem continue to drive development by adding new bells and whistles, we made a point of solving new problems with a generic solution able to address the technical debt at the same time. Yassine wrote the first ideas of what should be the group membership service that was needed for OpenStack, identifying several projects that could make use of this. I've presented this concept during the OpenStack Summit in Hong-Kong during an Oslo session. It turned out that the idea was well-received, and the week following the summit we started the tooz project on StackForge. Goals Tooz is a Python library that provides a coordination API. Its primary goal is to handle groups and membership of these groups in distributed systems. Tooz also provides another useful feature which is distributed locking. This allows distributed nodes to acquire and release locks in order to synchronize themselves (for example to access a shared resource). The architecture If you are familiar with distributed systems, you might be thinking that there are a lot of solutions already available to solve these issues: ZooKeeper, the Raft consensus algorithm or even Redis for example. You'll be thrilled to learn that Tooz is not the result of the NIH syndrome, but is an abstraction layer on top of all these solutions. It uses drivers to provide the real functionalities behind, and does not try to do anything fancy. All the drivers do not have the same amount of functionality of robustness, but depending on your environment, any available driver might be suffice. Like most of OpenStack, we let the deployers/operators/developers chose whichever backend they want to use, informing them of the potential trade-offs they will make. So far, Tooz provides drivers based on:

Kazoo (ZooKeeper)
Zake
memcached
redis
SysV IPC (only for distributed locks for now)
PostgreSQL (only for distributed locks for now)
MySQL (only for distributed locks for now)

All drivers are distributed across processes. Some can be distributed across the network (ZooKeeper, memcached, redis ) and some are only available on the same host (IPC). Also note that the Tooz API is completely asynchronous, allowing it to be more efficient, and potentially included in an event loop. Features Group membership

Tooz provides an API to manage group membership. The basic operations provided are: the creation of a group, the ability to join it, leave it and list its members. It's also possible to be notified as soon as a member joins or leaves a group. Leader election

Each group can have a leader elected. Each member can decide if it wants to run for the election. If the leader disappears, another one is elected from the list of current candidates. It's possible to be notified of the election result and to retrieve the leader of a group at any moment. Distributed locking

When trying to synchronize several workers in a distributed environment, you may need a way to lock access to some resources. That's what a distributed lock can help you with. Adoption in OpenStack Ceilometer is the first project in OpenStack to use Tooz. It has replaced part of the old alarm distribution system, where RPC was used to detect active alarm evaluator workers. The group membership feature of Tooz was leveraged by Ceilometer to coordinate between alarm evaluator workers. Another new feature part of the Juno release of Ceilometer is the distribution of polling tasks of the central agent among multiple workers. There's again a group membership issue to know which nodes are online and available to receive polling tasks, so Tooz is also being used here. The Oslo team has accepted the adoption of Tooz during this release cycle. That means that it will be maintained by more developers, and will be part of the OpenStack release process. This opens the door to push Tooz further in OpenStack. Our next candidate would be write a service group driver for Nova. The complete documentation for Tooz is available online and has examples for the various features described here, go read it if you're curious and adventurous!

2 November 2014

Thomas Goirand: OpenStack packaging activity: October 2014

Wednesday 1:
Uploaded python-xstatic-jquery removing the .pth file from package.
Uploaded python-taskflow 0.4 to experimental, needed by Cinder Juno RC1
Uploaded Cinder Juno RC1 to experimental Thuesday 2:
Finally understood that the issue with murano-dashboard was that it doesn t build without django-nose >= 1.2. Opened new patch at: https://review.openstack.org/125565
Uploaded murano-dashboard to Experimental, now using django-nose from wheezy-backports in my jenkins setup, so murano-dashboard can be built for Wheezy.
Uploaded python-oslotest 1.1.0.0 (really is upstream 1.1.0)
Uploaded python-oslo.serialization 1.0.0-1 (needed by Ceilometer Juno RC1)
Uploaded Ceilometer Juno RC1
Uploaded Heat Juno RC1
Uploaded oslo.rootwrap 1.3.0.0
Uploaded oslo.db 1.0.2 (bugfix release)
Wrote a new system in openstack-pkg-tools to generate init scripts and. service files from a template, so we don t have to write N times the same thing. Friday 3:
Reworked openstack-pkg-tools to generate automatically sysv-rc init scripts, upstart jobs and systemd unit files, making the system more unified and consistent.
Applied the new system to all packages in Juno.
Uploaded Keystone 2014.1.3-1 to Sid
Uploaded Nova 2014.1.3-1 to Sid
Uploaded Glance 2014.1.3-1 to Sid
Uploaded Neutron 2014.1.3-1 to Sid
Uploaded Horizon 2014.1.3-1 to Sid
Uploaded Cinder 2014.1.3-1 to Sid
Uploaded Trove 2014.1.3-1 to Sid
Uploaded Ceilometer 2014.1.3-1 to Sid Saturday 4:
Uploaded Horizon Juno RC1 to Experimental
Uploaded oslotest 1.1.0.0 to Experimental
Uploaded Ironic Juno RC1 to Experimental
Uploaded Designate Juno RC1 to Experimental
Uploaded Nova Juno RC1 to Experimental
Uploaded Neutron Juno RC1 to Experimental
Uploaded openstack-meta-packages 0.10 to Sid
Uploaded openstack-pkg-tools 13 to Experimental
Uploaded murano-agent Juno RC1 to Experimental Sunday 5:
Uploaded Sahara Juno RC1 to Experimental (it s been approved by FTP masters)
Uploaded Murano Juno RC1 to Experimental (it s been approved by FTP masters)
Fixed all debian/watch file to understand ~b and ~rc releases (fixed applied on both Icehouse and Juno branches, though no upload yet, I ll wait until uploads are needed to have this in the archive ).
Uploaded Trove Juno RC1 to Experimental
Uploaded Sahara Juno RC1 to Experimental With this last upload, everything of Juno RC1 is in Debian Experimental! \o/ Monday 6:
Uploaded some fixes for Nova 2014.1.3-2 in Sid:
* Removed contrib/boto_v6/* in debian/copyright, replaced bin/nova-manage by nova/cmd/ baremetal_, manage.py.
* Mangling upstream rc and beta versions in watch file.
* Added 9990_update_german_programm_messages.patch, thanks to Helge Kreutzmann <debian@helgefjell.de>.
* Fixed correct de.po (Closes: #763682).
* Added nl.po initial Debconf translation, thanks to Frans Spiesschaert <Frans.Spiesschaert@yucom.be> (Closes: #764125).
* Standards-Version is now 3.9.6 (no change).
Upstreamed german translation of po file: https://review.openstack.org/126212
Uploaded Designate 2014.1-12 to Sid, added new de.po also to the Juno branch on alioth (but didn t upload the fix yet).
Uploaded sphinxcontrib-httpdomain new upstream 1.3.0 release, added Python 3.x support to the package, and transitionning to the correct namespaced python-sphinxcontrib.httpdomain package name.
Spent most of the day fixing python-xstatic issues:
o uploaded libjs-twitter-bootstrap-datepicker 1.3.1
o uploaded python-xstatic-bootstrap-datepicker requiring this libjs package
o fixed python-xstatic-jquery-ui package
Now Horizon Juno RC1 builds well, and can be installed again. \o/ Tuesday 7:
Backported python-libvirt 1.2.8 in Wheezy (for Nova Juno support )
Uploaded Ceilometer Juno RC1 with ceilometer-agent-ipmi added (the package will therefore go through the NEW queue).
Uploaded python-requestbuilder 0.2.2-1, needed by the maintainers of euca2ools.
Ported the unified generated init system scripts to Icehouse packages.
Uploaded to Sid updates for: openstack-pkg-tools, ceilometer, cinder, glance, keystone, cinder, nova. Wednesday 8:
Uploaded openstack-pkg-tools 16 to Sid
Uploaded murano-dashboard (with upstream fix to remove font-awesome, which was the reason for FTP master s rejection)
Uploaded ceilometer Juno RC1 with new IPMI agent package (needed for Ironic support).
Uploaded heat 2014.1.3 which I forgot.
Tested https://review.openstack.org/#/c/126777/ which solves the bug I sent to launchpad and approved the patch.
Uploaded python-requestbuilder 0.2.3 Thesday 9:
Worked on fixing Neutron Alembic migration with SQLite3.
Uploade Neutron 2014.2~rc1-3 with a fix for a patch that was destroying dhcp.py. This still doesn t include the Alembic migration fixes, which are still a WIP. Firday 10:
Finished fixing Neutron SQLite 3 Alembic migrations.
Uploaded neutron 2014.2~rc1-3 with the fixes.
Fixed Ceilometer wrong generation of sample config file, using upstream patch (after discussing with Julien Danjou so he wrote it).
Uploaded Ceilometer 2014.2~rc1-4 with the fix
Checked that all packages can be installed in non-interactive mode. This works well now! \o/ Saturday 11:
Uploaded new version of python-xstatic-angular-cookies (ie: 1.2.24.1-2) which allows a higher version of libjs-angularjs (otherwise the package is not installable in Sid/Jessie since last version of angularjs is uploaded). Sunday 12:
Uploaded factory-boy fix for FTBFS
Uploaded python-django-appconf FTBFS
Uploaded Horizon Juno RC2
Uploaded Heat Juno RC3
Uploaded Trove Juno RC2
Uploaded Glance Juno RC2
Uploaded Sahara Juno RC2
Uploaded Nova Juno RC2
Uploaded Neutron Juno RC2
Uploaded Cinder Juno RC2
Uploaded murano-dashboard Juno RC2 Monday 13:
Uploaded python-heatclient 0.2.12-1 to Experimental
Uploaded python-yaql with RC bugfix to Sid (missing dep on python3-ply). Thuesday 14:
Fixed arping newly added dependency in Neutron
Started testing install of all of openstack Juno at once Wednesday 15:
Fixed missing configuration files in Ceilometer (ceilometer-api couldn t start)
Upgraded to Ceilometer Juno RC3.
Backported python-setuptools, as keystone and others are broken due to the namespace of modules not working correctly with the old version of python-pkg-resources. With the new one, everything is back in order. Thesday 16:
Uploaded to Debian Experimental the final release of Juno (ie: 2014.2) for:
Sahara
Nova
Ceilometer
Cinder
Heat
Neutron
Glance
Keystone
Horizon (with fix for Django 1.7 in the wsgi file)
Uploaded to Sid:
Swift 2.2.0
Horizon 2014.1.3-3 with fix for Django 1.7 in the wsgi file that was crashing apache. OpenStack Juno packages are out!!! (ready the day of the upstream release ) Friday 17:
Investigated Trove RC bug #765348, couldn t reproduce, and therefore closed it.
Uploaded Ironic Juno final to Experimental
Uploaded Designate Juno final to Experimental
Uploaded a fix for python-jingo which failed to build with Django 1.7. Sent pull request upstream: https://github.com/jbalogh/jingo/pull/63
Uploaded CVE-2014-7230 & CVE-2014-7231 fixes for both Cinder and Nova in Debian Sid, as per OSSA 2014-036 patches. No need to upload a fix for Trove, as 2014.1.3 already has the fixes. Saturday 18:
Started building Trusty packages
Fixed oslo-config so that it never depends on python3-argparse, which doesn t exist (uploaded to Experimental)
Uploaded python-django-pyscss 1.0.3-2 with python-simplejson now as build-depends (it failed to build in my Trusty jenkins without it).
Uploaded a fix for stevedore and oslo-config to not depends on python3-argparse in Ubuntu (added debian/py3dist-overrides) Sunday 19:
Uploaded python-taskflow with ordereddict in debian/pydist-overrides.
Backported JS packages for Horizon and libvirt for Trusty (from Sid). My new Jenkin server is now producing a full set of Juno packages for Ubuntu trusty. And of course, it s updated on each git push, just like for the Wheezy backports. Monday 20:
Added FORCE_COULEUR=1 when running tests in python-couleur, so that it doesn t fail when running with git-buildpackage. Uploaded result in Sid.
Fixed python-mockito so that it never downloads distribute or nose on its clean target, which was annoying when running git-buildpackage. Uploaded to Sid.
Started to work again on automatic package deployment using openstack-deploy, from the openstack-meta-packages source package. Thuesday 21, Wednesday 22:
Worked on testing packages, did couples of minor fixes, reworked some of the default configuration files to match the install-guide, move configuration directive to the correct new section in nova.conf, etc. Thursday 23:
Patch the Neutron chapter in the install-guide to take into account the changes done on Thuesday 21, Wednesday 22, and simplify the install procedure in Debian. https://review.openstack.org/#/c/130501/ Friday 24:
Busy packing my stuff for moving to France Not much packaging work, except more auto-deploy stuff and some tests. Saturday 25:
Uploaded Nova, Neutron, Cinder and Horizon Icehouse in Sid, including some debconf translation updates, beating the Jessie freeze deadline in 10 days.
Fixed and uploaded openstack-debian-images in Sid: the login option wasn t modifying the default sudoers file, which always contained debian , instead of the custom login. Sunday 26:
Traveled to Moscow Monday 27 & Tuesday 28:
Fixed some murano & murano-dashboard stuff, thanks to the help of some murano team members in Moscow office. Uploaded fixes for murano & murano-dashboard. Tested that murano-dashboard works well, and now it does! :)
Uploaded version dependency fixes for python-xstatic-angular-cookies and python-xstatic-d3 which couldn t be installed in Sid/Jessie because of libjs-* updates. Wednesday 29:
Meeting with Saratov team
Updated sahara endpoints, but didn t upload the package yet to Debian. Thursday 30:
Uploaded ruby-raemon needed for Astute (part of Fuel web).
Packaged ruby-symboltable (not uploaded yet). Friday 31:
Wrote unit test runner for python-webpy (the current package doesn t have unit test runs).
Uploaded python-dbutils (needed by python-web.py unit tests) to Sid: now in NEW queue
Uploaded python-nose-parametrized & python-nose-timer to Sid: now in NEW queue
Uploaded sahara -2 fixing the API endpoint registration URL and service name.
Uploaded python-sphinxcontrib.plantuml to Sid: : now in NEW queue

15 September 2014

Julien Danjou: Python bad practice, a concrete case

A lot of people read up on good Python practice, and there's plenty of information about that on the Internet. Many tips are included in the book I wrote this year, The Hacker's Guide to Python. Today I'd like to show a concrete case of code that I don't consider being the state of the art.

In my last article where I talked about my new project Gnocchi, I wrote about how I tested, hacked and then ditched whisper out. Here I'm going to explain part of my thought process and a few things that raised my eyebrows when hacking this code. Before I start, please don't get the spirit of this article wrong. It's in no way a personal attack to the authors and contributors (who I don't know). Furthermore, whisper is a piece of code that is in production in thousands of installation, storing metrics for years. While I can argue that I consider the code not to be following best practice, it definitely works well enough and is worthy to a lot of people. Tests

The first thing that I noticed when trying to hack on whisper, is the lack of test. There's only one file containing tests, named test_whisper.py, and the coverage it provides is pretty low. One can check that using the coverage tool.

$ coverage run test_whisper.py
...........
----------------------------------------------------------------------
Ran 11 tests in 0.014s
 
OK
$ coverage report
Name           Stmts   Miss  Cover
----------------------------------
test_whisper     134      4    97%
whisper          584    227    61%
----------------------------------
TOTAL            718    231    67%

While one would think that 61% is "not so bad", taking a quick peak at the actual test code shows that the tests are incomplete. Why I mean by incomplete is that they for example use the library to store values into a database, but they never check if the results can be fetched and if the fetched results are accurate. Here's a good reason one should never blindly trust the test cover percentage as a quality metric. When I tried to modify whisper, as the tests do not check the entire cycle of the values fed into the database, I ended up doing wrong changes but had the tests still pass. No PEP 8, no Python 3

The code doesn't respect PEP 8 . A run of flake8 + hacking shows 732 errors While it does not impact the code itself, it's more painful to hack on it than it is on most Python projects. The hacking tool also shows that the code is not Python 3 ready as there is usage of Python 2 only syntax. A good way to fix that would be to set up tox and adds a few targets for PEP 8 checks and Python 3 tests. Even if the test suite is not complete, starting by having flake8 run without errors and the few unit tests working with Python 3 should put the project in a better light. Not using idiomatic Python

A lot of the code could be simplified by using idiomatic Python. Let's take a simple example:

def fetch(path,fromTime,untilTime=None,now=None):
  fh = None
  try:
    fh = open(path,'rb')
    return file_fetch(fh, fromTime, untilTime, now)
  finally:
    if fh:
      fh.close()

That piece of code could be easily rewritten as:

def fetch(path,fromTime,untilTime=None,now=None):
  with open(path, 'rb') as fh:
    return file_fetch(fh, fromTime, untilTime, now)

This way, the function looks actually so simple that one can even wonder why it should exists but why not. Usage of loops could also be made more Pythonic:

for i,archive in enumerate(archiveList):
  if i == len(archiveList) - 1:
    break

could be actually:

for i, archive in itertools.islice(archiveList, len(archiveList) - 1):

That reduce the code size and makes it easier to read through the code. Wrong abstraction level

Also, one thing that I noticed in whisper, is that it abstracts its features at the wrong level. Take the create() function, it's pretty obvious:

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
  # Set default params
  if xFilesFactor is None:
    xFilesFactor = 0.5
  if aggregationMethod is None:
    aggregationMethod = 'average'
 
  #Validate archive configurations...
  validateArchiveList(archiveList)
 
  #Looks good, now we create the file and write the header
  if os.path.exists(path):
    raise InvalidConfiguration("File %s already exists!" % path)
  fh = None
  try:
    fh = open(path,'wb')
    if LOCK:
      fcntl.flock( fh.fileno(), fcntl.LOCK_EX )
 
    aggregationType = struct.pack( longFormat, aggregationMethodToType.get(aggregationMethod, 1) )
    oldest = max([secondsPerPoint * points for secondsPerPoint,points in archiveList])
    maxRetention = struct.pack( longFormat, oldest )
    xFilesFactor = struct.pack( floatFormat, float(xFilesFactor) )
    archiveCount = struct.pack(longFormat, len(archiveList))
    packedMetadata = aggregationType + maxRetention + xFilesFactor + archiveCount
    fh.write(packedMetadata)
    headerSize = metadataSize + (archiveInfoSize * len(archiveList))
    archiveOffsetPointer = headerSize
 
    for secondsPerPoint,points in archiveList:
      archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
      fh.write(archiveInfo)
      archiveOffsetPointer += (points * pointSize)
 
    #If configured to use fallocate and capable of fallocate use that, else
    #attempt sparse if configure or zero pre-allocate if sparse isn't configured.
    if CAN_FALLOCATE and useFallocate:
      remaining = archiveOffsetPointer - headerSize
      fallocate(fh, headerSize, remaining)
    elif sparse:
      fh.seek(archiveOffsetPointer - 1)
      fh.write('\x00')
    else:
      remaining = archiveOffsetPointer - headerSize
      chunksize = 16384
      zeroes = '\x00' * chunksize
      while remaining > chunksize:
        fh.write(zeroes)
        remaining -= chunksize
      fh.write(zeroes[:remaining])
 
    if AUTOFLUSH:
      fh.flush()
      os.fsync(fh.fileno())
  finally:
    if fh:
      fh.close()

The function is doing everything: checking if the file doesn't exist already, opening it, building the structured data, writing this, building more structure, then writing that, etc. That means that the caller has to give a file path, even if it just wants a whipser data structure to store itself elsewhere. StringIO() could be used to fake a file handler, but it will fail if the call to fcntl.flock() is not disabled and it is inefficient anyway. There's a lot of other functions in the code, such as for example setAggregationMethod(), that mixes the handling of the files even doing things like os.fsync() while manipulating structured data. This is definitely not a good design, especially for a library, as it turns out reusing the function in different context is near impossible. Race conditions

There are race conditions, for example in create() (see added comment):

if os.path.exists(path):
  raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
  # TOO LATE I ALREADY CREATED THE FILE IN ANOTHER PROCESS YOU ARE GOING TO
  # FAIL WITHOUT GIVING ANY USEFUL INFORMATION TO THE CALLER :-(
  fh = open(path,'wb')

That code should be:

try:
  fh = os.fdopen(os.open(path, os.O_WRONLY   os.O_CREAT   os.O_EXCL), 'wb')
except OSError as e:
  if e.errno = errno.EEXIST:
    raise InvalidConfiguration("File %s already exists!" % path)

to avoid any race condition. Unwanted optimization

We saw earlier the fetch() function that is barely useful, so let's take a look at the file_fetch() function that it's calling.

def file_fetch(fh, fromTime, untilTime, now = None):
  header = __readHeader(fh)
[...]

The first thing the function does is to read the header from the file handler. Let's take a look at that function:

def __readHeader(fh):
  info = __headerCache.get(fh.name)
  if info:
    return info
 
  originalOffset = fh.tell()
  fh.seek(0)
  packedMetadata = fh.read(metadataSize)
 
  try:
    (aggregationType,maxRetention,xff,archiveCount) = struct.unpack(metadataFormat,packedMetadata)
  except:
    raise CorruptWhisperFile("Unable to read header", fh.name)
[...]

The first thing the function does is to look into a cache. Why is there a cache? It actually caches the header based with an index based on the file path (fh.name). Except that if one for example decide not to use file and cheat using StringIO, then it does not have any name attribute. So this code path will raise an AttributeError. One has to set a fake name manually on the StringIO instance, and it must be unique so nobody messes with the cache

import StringIO
 
packedMetadata = <some source>
fh = StringIO.StringIO(packedMetadata)
fh.name = "myfakename"
header = __readHeader(fh)

The cache may actually be useful when accessing files, but it's definitely useless when not using files. But it's not necessarily true that the complexity (even if small) that the cache adds is worth it. I doubt most of whisper based tools are long run processes, so the cache that is really used when accessing the files is the one handled by the operating system kernel, and this one is going to be much more efficient anyway, and shared between processed. There's also no expiry of that cache, which could end up of tons of memory used and wasted. Docstrings

None of the docstrings are written in a a parsable syntax like Sphinx. This means you cannot generate any documentation in a nice format that a developer using the library could read easily. The documentation is also not up to date:

def fetch(path,fromTime,untilTime=None,now=None):
  """fetch(path,fromTime,untilTime=None)
[...]
"""
 
def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
  """create(path,archiveList,xFilesFactor=0.5,aggregationMethod='average')
[...]
"""

This is something that could be avoided if a proper format was picked to write the docstring. A tool cool be used to be noticed when there's a diversion between the actual function signature and the documented one, like missing an argument. Duplicated code

Last but not least, there's a lot of code that is duplicated around in the scripts provided by whisper in its bin directory. Theses scripts should be very lightweight and be using the console_scripts facility of setuptools, but they actually contains a lot of (untested) code. Furthermore, some of that code is partially duplicated from the whisper.py library which is against DRY. Conclusion There are a few more things that made me stop considering whisper, but these are part of the whisper features, not necessarily code quality. One can also point out that the code is very condensed and hard to read, and that's a more general problem about how it is organized and abstracted. A lot of these defects are actually points that made me start writing The Hacker's Guide to Python a year ago. Running into this kind of code makes me think it was a really good idea to write a book on advice to write better Python code!

The Hacker's Guide to Python

A book I wrote talking about designing Python applications, state of the art, advice to apply when building your application, various Python tips, etc. Interested? Check it out.

19 August 2014

Julien Danjou: Tracking OpenStack contributions in GitHub

I've switched my Git repositories to GitHub recently, and started to watch my contributions statistics, which were very low considering I spend my days hacking on open source software, especially OpenStack.

OpenStack hosts its Git repositories on its own infrastructure at git.openstack.org, but also mirrors them on GitHub. Logically, I was expecting GitHub to track my commits there too, as I'm using the same email address everywhere. It turns out that it was not the case, and the help page about that on GitHub describes the rule in place to compute statistics. Indeed, according to GitHub, I had no relations to the OpenStack repositories, as I never forked them nor opened a pull request on them (OpenStack uses Gerrit). Starring a repository is enough to build a relationship between a user and a repository, so this is was the only thing needed to inform GitHub that I have contributed to those repositories. Considering OpenStack has hundreds of repositories, I decided to star them all by using a small Python script using pygithub.

And voil , my statistics are now including all my contributions to OpenStack!

18 August 2014

Julien Danjou: OpenStack Ceilometer and the Gnocchi experiment

A little more than 2 years ago, the Ceilometer project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning. In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it. Ceilometer early design For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called samples. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster ) constructs and emits a sample headed for the storage component that we call the collector. This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support. The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want. The scalability issue We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each sample generated by Ceilometer and is stored as is). That basically means that the sample data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful. It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are O(n) where n is the number of samples recorded (see big O notation if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable. During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway.

Thinking outside the box Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box. When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics. Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into oslo.messaging. Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc These are not events, since nothing is happening. These are facts, states we need to meter. Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right. I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet this made it a good challenge. The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API as we do in all OpenStack services. This should cover the metric storage pretty well. Cooking Gnocchi

A cloud of gnocchis! At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right? The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process. The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard. One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the RRDtool have been doing for a long time on the fly, and I decided it was a good trail to follow. I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both. The first driver written to achieve that and store those metrics inside Gnocchi was based on whisper. Whisper is the file format used to store metrics for the Graphite project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable. Storing metrics for each entities in a different whisper file and putting them in Swift turned out to have a fantastic algorithm complexity: it was O(1). Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design. However, it turned out that whisper has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (time.time()). I've started to hack on that in my own fork, but then everything broke. The whisper project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform whisper into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage). I decided to take a break and look into alternatives, and stumbled upon Pandas, a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it carbonara (a wink to both the Carbon project and pasta, how clever!). The code is quite small (a third of whisper's, 200 SLOC vs 600 SLOC), does not have many of the whisper limitations and it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers. Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a InfluxDB driver, working closely with the upstream developer of that database. Dina Belova started to write an OpenTSDB driver. This helps to make sure the API is designed directly in the right way. Handling resources Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc.

Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now. Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network. All of these requirements led to to the design of what's called the indexer. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on SQLAlchemy and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed. We plan to establish a model for all known OpenStack resources (instances, volumes, networks, ) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes. Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable.

REST API All of this is exported via a REST API that was partially designed and documented in the Gnocchi specification in the Ceilometer repository; though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer. The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back.

Roadmap & Ceilometer integration All of this plan has been exposed and discussed with the Ceilometer team during the last OpenStack summit in Atlanta in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue. It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to. Early benchmarks Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July. The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes.

For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of O(1), and not O(n) anymore.

Next steps

While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now. The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the OpenStack Summit for Kilo that will take place next November in Paris.

7 June 2014

Vasudev Kamath: Exposing function in python module using entry_points, WSME in a Flask webapp

The heading might be ambiguous, but I couldn't figure out better heading so let me start by explaining what I'm trying to solve here.

Problem I have a python module which contains a function which I want to expose as a REST web service in a Flask application. I use WSME for Flask application, which actually needs signature of function in question and problem comes to picture because function to be exposed is foreign to Flask application, it resides in separate python module.

Solution While reading Julien Danjou's Hackers Guide To Python book I came across the setuptools entry_points concept which can be used to extend existing feature of a tool like plug-ins. So here I'm going to use this entry_points feature from setuptools to provide function in the module which can expose the signature of function[s] to be exposed through REST. Of course this means I need to modify module in question to write entry_points and function for giving out signature of function to be exposed. I will explain this with small example. I have a dummy module which provides a add function and a function which exposes the add functions signature.

def add(a, b):
    return a + b
def expose_rest_func():
    return [add, int, int, int]

This is stored in dummy/__init__.py file. I use pbr tool to package my python module. Below is content of setup.cfg file.

[metadata]
name = dummy
author = Vasudev Kamath
author-email = kamathvasudev@gmail.com
summary = Dummy module for testing purpose
version = 0.1
license = MIT
description-file =
  README.rst
requires-python = >= 2.7
[files]
packages =
  dummy
[entry_points] =
myapp.api.rest =
  rest = dummy:expose_rest_func

The special thing in above file is entry_points section, which defines function to be hooked into entry_point. In our case entry_point myapp.api.rest is used by our Flask application to interact with modules which expose them. The function which will be got accessing the entry_point is expose_rest_func which gives the function to be exposed its arg types and return types as a list. If we are looking at only supporting python3 it was sufficient to know function name only and use function annotations in function definition. Since I want to support both python2 and python3 this is out of question. Now, just run the following command in virtualenv to get the module installed.

PBR_VERSION=0.1 python setup.py sdist
pip install dist/dummy_module-0.1.tar.gz

Now if you want to see if the module is exposing entry_point or not just use entry_point_inspector tool after installing you will get a command called epi if you run it as follows you should note the dummy_module in its output

epi group list
+-----------------------------+
  Name                         
+-----------------------------+
  cliff.formatter.completion   
  cliff.formatter.list         
  cliff.formatter.show         
  console_scripts              
  distutils.commands           
  distutils.setup_keywords     
  egg_info.writers             
  epi.commands                 
  flake8.extension             
  setuptools.file_finders      
  setuptools.installation      
  myapp.api.rest               
  stevedore.example.formatter  
  stevedore.test.extension     
  wsme.protocols               
+-----------------------------+

So our entry_point is exposed now, we need to access it in our Flask application and expose the function using WSME. It is done by below code.

from wsmeext.flask import signature
import flask
import pkg_resources
def main():
   app = flask.Flask(__name__)
   app.config['DEBUG'] = True
   for entrypoint in pkg_resources.iter_entry_points('myapp.api.rest'):
       # Ugly but fix is only supporting python3
       func_signature = entrypoint.load()()
       app.route('/' + func_signature[0].__name__, methods=['POST'])(
           signature(func_signature[-1],
               *func_signature[1:-1])(func_signature[0]))
   app.run()
if __name__ == '__main__':
    main()

entry_point myapp.api.rest are iterated using the pkg_resources package provided by setuptools, when I load the entry_point I get back the function to be used which is called in same place to get function signature. Then I'm calling Flask and WSME decorator functions (yeah instead of decorating I'm using them directly over function to be exposed). The code looks bit ugly at the place where I'm accessing list using slices but I can't help it due to limitation of python2 with python3 there is new packing and unpacking stuff which makes code look bit more cooler see below.

from wsmeext.flask import signature
import flask
import pkg_resources
def main():
    app = flask.Flask(__name__)
    app.config['DEBUG'] = True
    for entrypoint in pkg_resources.iter_entry_points('silpa.api.rest'):
        func, *args, rettype = entrypoint.load()()
        app.route('/' + func.__name__, methods=['POST'])(
        signature(rettype, *args)(func_signature[0]))
    app.run()
if __name__ == '__main__':
    main()

You can access the service at http://localhost:5000/add depending on Accept header of HTTP you will get either XML or JSON response. If you access it from browser you will get XML response.

Usecase Now if you are wondering what is the reason behind this experience, this is for SILPA Project. I'm trying to implement REST service for all Indic language computing module. Since all these module are independent of SILPA which is a Flask web app I had to find a way to achieve this, and this is what I came up with.

Conclusion I'm not sure if there is any other approaches to achieve this, if there I would love to hear about them. You can write your comments and suggestion over email

3 June 2014

Vasudev Kamath: Using WSME with Flask microframework

After reading Julien Danjou's I found out WSME (Web Service Made Easy) a Python framework which allows us to easily create Web Services in Python. For SILPA we needed a REST like interface and I thought of giving it try as WSME readily advertised the Flask integration, and this post was born when I read the documentation for Flask integration. First of all Flask is a nice framework which will right way allow development of REST api for simple purposes, but my requirement was bit complicated where I had to expose function in a separate python modules through SILPA. I think detailed requirement can be part of another post, so let me explain how to use WSME with Flask app. WSME integration with Flask is done via decorator function wsmeext.flask.signature which expects you to provide it with signature of function to expose. And here is its documentation, basically signature of signature function is

wsmeext.flask.signature(return_type, *arg_types, **options)

Yeah thats all docs have sadly. So basically exposing is the only thing WSME handles for us here, routing and other stuffs need to be done by Flask itself. So lets consider a example, simple function to add as shown below.

def add(a, b):
    return a + b

For providing REST like service, all you need below code.

from flask import Flask
from wsmeext.flask import signature
app = Flask(__name__)
@app.route('/add')
@signature(int, int, int)
def add(a, b):
    return a + b
if __name__ == '__main__':
       app.run()

So first argument to signature is return type of function and rest arguments are the argument to function to be exposed. Now you can access the newly exposed service by visiting http://localhost:5000/add but don't forget to pass the arguments either via query string or through post. You can restrict methods of access via Flask's route. So what's the big deal of not having docs right?.. Well fun part began when we use bit more complex return type like dictionaries or lists . Below is modified code I'm using to demonstrate problem I faced during using dict as return type.

from flask import Flask
from wsmeext.flask import signature
app = Flask(__name__)
@app.route('/add')
@signature(dict, int, int)
def add(a, b):
    return  "result": a + b 
if __name__ == '__main__':
    app.run()

Basically I'm returning a dictionary containing result now, for demonstration purpose. When I run the application boom python barked at me with following message.

Traceback (most recent call last):
File "wsme_dummy.py", line 7, in <module>
 @signature(dict, int, int)
File "c:\Users\invakam2\.virtualenvs\wsmetest\lib\site-packages\wsmeext\flask.py", line 48, in decorator
 funcdef.resolve_types(wsme.types.registry)
File "c:\Users\invakam2\.virtualenvs\wsmetest\lib\site-packages\wsme\api.py", line 109, in resolve_types
 self.return_type = registry.resolve_type(self.return_type)
File "c:\Users\invakam2\.virtualenvs\wsmetest\lib\site-packages\wsme\types.py", line 739, in resolve_type
 type_ = self.register(type_)
File "c:\Users\invakam2\.virtualenvs\wsmetest\lib\site-packages\wsme\types.py", line 668, in register
 class_._wsme_attributes = None
TypeError: can't set attributes of built-in/extension type 'dict'

After going through code from files involved in above traces this is what I found

wsmeext.flask.signature inturn uses wsme.signature which is just alias of wsme.api.signature.
Link in documentation in sentence See @signature for parameter documentation is broken and should actually link to wsme.signature in docs.
wsme.signature actually calls resolve_type to check on types of return and arguments. This function checks if types are instance of dict or list in such cases it creates instances of wsme.type.DictType and wsme.type.ArrayType respectively with values from the argument.
When I just passed built-in type dict the control went to else part which just passed the type to wsme.type.Register.registry function which tries to set the attribute _wsme_attribute which actually raises TypeError as we can't set attribute for built-in types.

So by inspecting code of wsme.type.Registry.resolve_type and wsme.type.Registry.register its clear that what signature expects when arguments or return type is dictionary/list is instance of dictionary/list with type of value in it. May be sentence is bit vague but I'm not sure how to put it more clearly, as an example in our case add function returns dictioanry with key as string and value as int, so return type argument for signature will be str: int . Similarly if you return array with int values it will be [int]. With above understanding our add function now looks like below.

@signature( str: int , int, int)
def add(a, b):
   return  'result': a + b

and now code worked just fine!. What I couldn't figure out here is there is no way to have tuple as return value or argument, but I guess that is not big deal. So immidiate task for me after finding this is fix the link in documentation to point to wsme.signature and probably put some note some where in documentation about above finding.

30 May 2014

Julien Danjou: OpenStack Design Summit Juno, from a Ceilometer point of view

Last week was the OpenStack Design Summit in Atlanta, GA where we, developers, discussed and designed the new OpenStack release (Juno) coming up. I've been there mainly to discuss Ceilometer upcoming developments.

The summit has been great. It was my third OpenStack design summit, and the first one not being a PTL, meaning it was a largely more relaxed summit for me! On Monday, we started by a 2.5 hours meeting with Ceilometer core developers and contributors about the Gnocchi experimental project that I've started a few weeks ago. It was a great and productive afternoon, and allowed me to introduce and cover this topic extensively, something that would not have been possible in the allocated session we had later in the week. Ceilometer had his design sessions running mainly during Wednesday. We noted a lot of things and commented during the sessions in our Etherpads instances. Here is a short summary of the sessions I've attended. Scaling the central agent I was in charge of the first session, and introduced the work that was done so far in the scaling of the central agent. Six months ago, during the Havana summit, I proposed to scale the central agent by distributing the tasks among several node, using a library to handle the group membership aspect of it. That led to the creation of the tooz library that we worked on at eNovance during the last 6 months. Now that we have this foundation available, Cyril Roelandt started to replace the Ceilometer alarming job repartition code by Taskflow and Tooz. Starting with the central agent is simpler and will be a first proof of concept to be used by the central agent then. We plan to get this merged for Juno. For the central agent, the same work needs to be done, but since it's a bit more complicated, it will be done after the alarming evaluators are converted. Test strategy The next session discussed the test strategy and how we could improve Ceilometer unit and functional testing. There is a lot in this area to be done, and this is going to be one of the main focus of the team in the upcoming weeks. Having Tempest tests run was a goal for Havana, and even if we made a lot of progress, we're still no there yet. Complex queries and per-user/project data collection This session, led by Ildik V ncsa, was about adding finer-grained configuration into the pipeline configuration to allow per-user and per-project data retrieval. This was not really controversial, though how to implement this exactly is still to be discussed, but the idea was well received. The other part of the session was about adding more in the complex queries feature provided by the v2 API. Rethinking Ceilometer as a Time-Series-as-a-Service This was my main session, the reason we met on Monday for a few hours, and one of the most promising session I hope of the week. It appears that the way Ceilometer designed its API and storage backends a long time ago is now a problem to scale the data storage. Also, the events API we introduced in the last release partially overlaps some of the functionality provided by the samples API that causes us scaling troubles. Therefore, I've started to rethink the Ceilometer API by building it as a time series read/write service, letting the audit part of our previous sample API to the event subsystem. After a few researches and experiments, I've designed a new project called Gnocchi, which provides exactly that functionality in a hopefully scalable way. Gnocchi is split in two parts: a time series API and its driver, and a resource indexing API with its own driver. Having two distinct driver sets allows it to use different technologies to store each data type in the best storage engine possible. The canonical driver for time series handling is based on Pandas and Swift. The canonical resource indexer driver is based on SQLAlchemy. The idea and project was well received and looked pretty exciting to most people. Our hope is to design a version 3 of the Ceilometer API around Gnocchi at some point during the Juno cycle, and have it ready as some sort of preview for the final release. Revisiting the Ceilometer data model This session led by Alexei Kornienko, kind of echoed the previous session, as it clearly also tried to address the Ceilometer scalability issue, but in a different way. Anyway, the SQL driver limitations have been discussed and Mehdi Abaakouk implemented some of the suggestions during the week, so we should very soon see more performances in Ceilometer with the current default storage driver. Ceilometer devops session We organized this session to get feedbacks from the devops community about deploying Ceilometer. It was very interesting, and the list of things we could improve is long, and I think will help us to drive our future efforts. SNMP inspectors This session, led by Lianhao Lu, discussed various details of the future of SNMP support in Ceilometer. Alarm and logs improvements This mixed session, led by Nejc Saje and Gordon Chung, was about possible improvements on the alarm evaluation system provided by Ceilometer, and making logging in Ceilometer more effective. Both half-sessions were interesting and led to several ideas on how to improve both systems. Conclusion Considering the current QA problems with Ceilometer, Eoghan Glynn, the new Project Technical Leader for Ceilometer, clearly indicated that this will be the main focus of the release cycle. Personally, I will be focused on working on Gnocchi, and will likely be joined by others in the next weeks. Our idea is to develop a complete solution with a high velocity in the next weeks, and then works on its integration with Ceilometer itself.

7 May 2014

Julien Danjou: Making of The Hacker's Guide to Python

As promised, today I would like to write a bit about the making of The Hacker's Guide to Python. It has been a very interesting experimentation, and I think it is worth sharing it with you. The inspiration All started out at the beginning of August 2013. I was spending my summer, as the rest of the year, hacking on OpenStack. As years passed, I got more and more deeply involved in the various tools that we either built or contributed to within the OpenStack community. And I somehow got the feeling that my experience with Python, the way we used it inside OpenStack and other applications during these last years was worth sharing. Worth writing something bigger than a few blog posts. The OpenStack project is doing code reviews, and therefore so did I for almost two years. That inspired a lot of topics, like the definitive guide to method decorators that I wrote at the time I started the hacker's guide. Stumbling upon the same mistakes or misunderstanding over and over is, somehow, inspiring. I also stumbled upon Nathan Barry's blog and book Authority which were very helpful to get started and some sort of guideline. All of that brought me enough ideas to start writing a book about Python software development for people already familiar with the language. The writing The first thing I started to do is to list all the topics I wanted to write about. The list turned out to have subjects that had no direct interest for a practical guide. For example, on one hand, very few developers know in details how metaclasses work, but on the other hand, I never had to write a metaclass during these last years. That's the kind of subject I decided not to write about, dropped all subjects that I felt were not going to help my reader to be more productive. Even if they could be technically interesting.

Then, I gathered all problems I saw during the code reviews I did during these last two years. Some of them I only recalled in the days following the beginning of that project. But I kept adding them to the table of contents, reorganizing stuff as needed. After a couple of weeks, I had a pretty good overview of the contents that there I will write about. All I had to do was to fill in the blank (that sounds so simple now). The entire writing of the took hundred hours spread from August to November, during my spare time. I had to stop all my other side projects for that. The interviews While writing the book, I tried to parallelize every thing I could. That included asking people for interviews to be included in the book. I already had a pretty good list of the people I wanted to feature in the book, so I took some time as soon as possible to ask them, and send them detailed questions. I discovered two categories of interviewees. Some of them were very fast to answer ( 1 week), and others were much, much slower. A couple of them even set up Git repositories to answer the questions, because that probably looked like an entire project to them. :-) So I had to not lose sight and kindly ask from time to time if everything was alright, and at some point started to kindly set some deadline. In the end, the quality of the answers was awesome, and I like to think that was because I picked the right people! The proof-reading Once the book was finished, I somehow needed to have people proof-reading it. This was probably the hardest part of this experiment. I needed two different types of reviews: technical reviews, to check that the content was correct and interesting, and language review. That one is even more important since English is not my native language. Finding technical reviewers seemed easy at first, as I had ton of contacts that I identified as being able to review the book. I started by asking a few people if they would be comfortable reading a simple chapter and giving me feedbacks. I started to do that in September: having the writing and the reviews done in parallel was important to me in order to minimize latency and the book's release delay. All people I contacted answered positively that they would be interested in doing a technical review of a chapter. So I started to send chapters to them. But in the end, only 20% replied back. And even after that, a large portion stopped reviewing after a couple of chapters. Don't get me wrong: you can't be mad at people not wanting to spend their spare time in book edition like you do. However, from the few people that gave their time to review a few chapters, I got tremendous feedback, at all level. That's something that was very important and that helped a lot getting confident. Writing a book alone for months without having anyone taking a look upon your shoulder can make you doubt that you are creating something worth it. As far as English proof-reading, I went ahead and used ODesk to recruit a professional proof-reader. I looked for people with the right skills: a good English level (being a native English speaker at least), be able to understand what the book was about, and being able to work with correct delays. I had mixed results from the people I hired, but I guess that's normal. The only error I made was not to parallelize those reviews enough, so I probably lost a couple of months on that. The toolchain

While writing the book, I did a few breaks to build a toolchain. What I call a toolchain is set of tools used to render the final PDF, EPUB and MOBI files of the guide. After some research, I decided to settle on AsciiDoc, using the DocBook output, which is then being transformed to LaTeX, and then to PDF, or either to EPUB directly. I realy on Calibre to convert the EPUB file to MOBI. It took me a few hours to do what I wanted, using some magic LaTeX tricks to have a proper render, but it was worth it and I'm particularly happy with the result. For the cover design, I asked my talented friend Nicolas to do something for me, and he designed the wonderful cover and its little snake! The publishing Publishing in an interesting topic people kept asking me about. This is what I had to answer a few dozens of time:

"Who is your editor?"
"Me."

I never had any plan for asking an editor to publish this book. Nowadays, asking an editor to publish a book feels to me like asking a major company to publish a CD. It feels awkward. However, don't get me wrong: there can be a few upsides of having an editor. They will find reviewers and review your book for you. Having the book review handled for you is probably a very good thing, considering how it was hard to me to get that in place. It can be especially important for a technical book. Also, your book may end up in brick and mortar stores and be part of a collection, both improving visibility. That may improve your book's selling, though the editor and all the intermediaries are going to keep the largest amount of the money anyway.

"Oh, you will publish it yourself, great. So you will print it and sell it to people?"
"Not really."

I've heard good stories about people using Gumroad to sell electronic contents, so after looking for competitors in that market, I picked them. I also had the idea to sell the book with Bitcoins, so I settled on Coinbase, because they have a nice API to do that. Setting up everything was quite straight-forward, especially with Gumroad. It only took me a few hours to do so. Writing the Coinbase application took a few hours too.

"Oh, you will sell it only as an ebook? That's too bad. You need a paper version. Many people will want a paper version."

My initial plan was to only sell online an electronic version. On the other hand, since I kept hearing that a printed version should exist, I decided to give it a try. I chose to work with Lulu because I knew people using it, and it was pretty simple to set up. The launch Once I had everything ready, I built the selling page and connected everything between Mailchimp, Gumroad, Coinbase, Google Analytics, etc. Writing the launch email was really exciting. I used Mailchimp feature to send the launch mail in several batches, just to have some margin in case of a sudden last minute problem. But everything went fine. Hurrah! I distributed around 200 copies of the ebook in the first 48 hours, for about $5000. That covered all the cost I had from the writing the book, and even more, so I was already pretty happy with the launch.

Retrospective In retrospect, something that I didn't do the best way possible is probably to build a solid mailing list of people interested, and to build an important anticipation and incentive to buy the book at launch date. My mailing list counted around 1500 people subscribed because they were interested in the launch of the book subscribed; in the end, probably only 10-15% of them bought the book during the launch, which is probably a bit lower than what I could expect. But more than a month later, I distributed in total almost 500 copies of the book (including physical units) for more than $10000, so I tend to think that this was a success. I still sell a few copies of the book each weeks, but the number are small compared to the launch. I sold less than 10 copies of the ebook using Bitcoins, and I admit I'm a bit disappointed and surprised about that. Physical copies represent 10% of the book distribution. It's probably a lot lower than most people that pushed me to do it thought it would be. But it is still higher of what I thought it would be. So I still would advise to have a paperback version of your book. At least because it's nice to have it in your library.

I only got positive feedbacks, a few typo notices, and absolutely no refund demand, which I really find amazing. The good news is also that I've been contacted with a couple of Korean and Chinese editors to get the book translated and published in those countries. If everything goes well, the book should be translated in the upcoming months and be available on these markets in 2015! If you didn't get a copy, it's still time to do so!

6 April 2014

Julien Danjou: Doing A/B testing with Apache httpd

When I started to write the landing page for The Hacker's Guide to Python, I wanted to try new things at the same time. I read about A/B testing a while ago, and I figured it was a good opportunity to test it out.

A/B testing If you do not know what A/B testing is about, take a quick look at the Wikipedia page on that subject. Long story short, the idea is to serve two different version of a page to your visitors and check which one is getting the most success. When you found which version is better, you can definitely switch to it. In the case of my book, I used that technique on the pre-launch page where people were able to subscribe to the newsletter. I didn't have a lot of things I wanted to test out on that page, so I just used that approach on the subtitle, being either "Learn everything you need to build a successful Python project" or "It's time you make the most out of Python". Statistically, each version would be served half of the time, so both would get the same number of view. I then would build statistics about which page was attracting the most subscribers. With the results I would be able to switch definitively to that version of the landing page. Technical design My Web site, this Web site, is entirely static and served by Apache httpd. I didn't want to use any dynamic page, language or whatever. Mainly because I didn't want to have something else to install and maintain just for that on my server. It turns out that Apache httpd is powerful enough to implement such a feature. There are different ways to build it, and I'm going to describe my choices here. The first thing to pick is a way to balance the display of the page. You need to find a way so that if you get 100 visitors, around 50 will see the version A of your page, and around 50 will see the version B of the page. You could use a random number generator, pick a random number for each visitor, and decides which page he's going to see. But it turns out that I didn't find a way to do that with Apache httpd at first sight. My second thought was to pick the client IP address. But it's not such a good idea, because if you got visitors from, for example, people behind a company firewall, they are all going to be served the same page, so that kind of kills the statistics. Finally, I picked time based balancing: if you visit the page on a second that is even, you get version A of the page, and if you visit the page on a second that is odd, you get version B. Simple, and so far nothing proves there are more visitors on even than odd seconds, or vice-versa. The next thing is to always serve the same page to a returning visitor. I mean that if the visitor comes back later and get a different version, that's cheating. I decided the system should always serve the same page once a visitor "picked" a version. To do that, it's simple enough, you just have to use cookies to store the page the visitor has been attributed, and then use that cookie if he comes back. Implementation To do that in Apache httpd, I used the powerful mod_rewrite that is shipped with it. I put 2 files in the books directory, named either "the-hacker-guide-to-python-a.html" and "the-hacker-guide-to-python-b.html" that got served when you requested "/books/the-hacker-guide-to-python".

RewriteEngine On
RewriteBase /books
 
# If there's a cookie called thgtp-pre-version set,
# use its value and serve the page
RewriteCond % HTTP_COOKIE  thgtp-pre-version=([^;])
RewriteRule ^the-hacker-guide-to-python$ % REQUEST_FILENAME -%1.html [L]
 
# No cookie yet and 
RewriteCond % HTTP_COOKIE  !thgtp-pre-version=([^;]+)
#   the number of seconds of the time right now is even
RewriteCond % TIME_SEC  [02468]$
# Then serve the page A and store "a" in a cookie
RewriteRule ^the-hacker-guide-to-python$ % REQUEST_FILENAME -a.html [cookie=thgtp-pre-version:a:julien.danjou.info,L]
 
# No cookie yet and 
RewriteCond % HTTP_COOKIE  !thgtp-pre-version=([^;]+)
#   the number of seconds of the time right now is odd
RewriteCond % TIME_SEC  [13579]$
# Then serve the page B and store "b" in a cookie
RewriteRule ^the-hacker-guide-to-python$ % REQUEST_FILENAME -b.html [cookie=thgtp-pre-version:b:julien.danjou.info,L]

With that few lines, it worked flawlessly. Results The results were very good, as it worked perfectly. Combined with Google Analytics, I was able to follow the score of each page. It turns out that testing this particular little piece of content of the page was, as expected, really useless. The final score didn't allow to pick any winner. Which also kind of proves that the system worked perfectly.

But it still was an interesting challenge!

25 March 2014

Julien Danjou: The Hacker's Guide to Python released!

And done! It took me just 8 months to do this entire book project around Python. From the first day I started writing to today, where I finally publish and sell almost entirely myself this book. I'm really proud of what I've achieved so far, as this was something totally new to me. Doing all of that has been a great adventure, and I'll promise I'll write something about that later on. A making of. For now, you can enjoy reading the book and learn a bit more about Python. I really hope it'll help you bring your Python-fu to a new level, and help you build great projects! Go check it out, and since this is first day of sale, enjoy 20% off by using the offer code THGTP20.

Next.

Previous.